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Abstract 

Additive regression trees are flexible non- 
parametric models and popular off-the-shelf 
tools for real-world non-linear regression. In 
application domains, such as bioinformat¬ 
ics, where there is also demand for proba¬ 
bilistic predictions with measures of uncer¬ 
tainty, the Bayesian additive regression trees 
(BART) model, introduced by Chipman et al. 
(2010), is increasingly popular. As data sets 
have grown in size, however, the standard 
Metropolis-Hastings algorithms used to per¬ 
form inference in BART are proving inade¬ 
quate. In particular, these Markov chains 
make local changes to the trees and suffer 
from slow mixing when the data are high¬ 
dimensional or the best-fitting trees are more 
than a few layers deep. We present a novel 
sampler for BART based on the Particle Gibbs 
(PG) algorithm (Andrieu et al., 2010) and 
a top-down particle filtering algorithm for 
Bayesian decision trees (Lakshminarayanan 
et al., 2013). Rather than making local 
changes to individual trees, the PG sampler 
proposes a complete tree to fit the residual. 
Experiments show that the PG sampler out¬ 
performs existing samplers in many settings. 

1 Introduction 

Ensembles of regression trees are at the heart of many 
state-of-the-art approaches for nonparametric regres¬ 
sion (Caruana and Niculescu-Mizil, 2006), and can be 
broadly classified into two families: randomized inde¬ 
pendent regression trees, wherein the trees are grown 
independently and predictions are averaged to reduce 
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variance, and additive regression trees, wherein each 
tree fits the residual not explained by the remainder of 
the trees. In the former category are bagged decision 
trees (Breiman, 1996), random forests (Breiman, 2001), 
extremely randomized trees (Geurts et al., 2006), and 
many others, while additive regression trees can be 
further categorized into those that are fit in a serial 
fashion, like gradient boosted regression trees (Fried¬ 
man, 2001), and those fit in an iterative fashion, like 
Bayesian additive regression trees (BART) (Chipman 
et al., 2010) and additive groves (Sorokina et al., 2007). 

Among additive approaches, BART is extremely pop¬ 
ular and has been successfully applied to a wide vari¬ 
ety of problems including protein-DNA binding, credit 
risk modeling, automatic phishing/spam detection, and 
drug discovery (Chipman et ah, 2010). Additive re¬ 
gression trees must be regularized to avoid overfitting 
(Friedman, 2002): in BART, over-fitting is controlled 
by a prior distribution preferring simpler tree struc¬ 
tures and non-extreme predictions at leaves. At the 
same time, the posterior distribution underlying BART 
delivers a variety of inferential quantities beyond predic¬ 
tions, including credible intervals for those predictions 
as well as a measures of variable importance. At the 
same time, BART has been shown to achieve predictive 
performance comparable to random forests, boosted 
regression trees, support vector machines, and neural 
networks (Chipman et al., 2010). 

The standard inference algorithm for BART is an iter¬ 
ative Bayesian backfitting Markov Chain Monte Carlo 
(MCMC) algorithm (Hastie et al., 2000). In particular, 
the MCMC algorithm introduced by Chipman et al. 
(2010) proposes local changes to individual trees. This 
sampler can be computationally expensive for large 
datasets, and so recent work on scaling BART to large 
datasets (Pratola et al., 2013) considers using only a 
subset of the moves proposed by Chipman et al. (2010). 
However, this smaller collection of moves has been ob¬ 
served to lead to poor mixing (Pratola, 2013) which 
in turn produces an inaccurate approximation to the 
posterior distribution. While a poorly mixing Markov 
chain might produce a reasonable prediction in terms 
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of mean squared error, BART is often used in scenarios 
where its users rely on posterior quantities, and so there 
is a need for computationally efficient samplers that 
mix well across a range of hyper-parameter settings. 

In this work, we describe a novel sampler for BART 
based on ( 1 ) the Particle Gibbs (PG) framework pro¬ 
posed by Andrieu et al. (2010) and ( 2 ) the top-down 
sequential Monte Garlo algorithm for Bayesian deci¬ 
sion trees proposed by Lakshminarayanan et al. (2013). 
Loosely speaking, PG is the particle version of the 
Gibbs sampler where proposals from the exact condi¬ 
tional distributions are replaced by conditional ver¬ 
sions of a sequential Monte Carlo (SMC) algorithm. 
The complete sampler follows the Bayesian backfitting 
MCMC framework for BART proposed by Chipman 
et al. ( 2010 ); the key difference is that trees are sam¬ 
pled using PG instead of the local proposals used by 
Chipman et al. (2010). Our sampler, which we re¬ 
fer to as PG-BART, approximately samples complete 
trees from the conditional distribution over a tree fit¬ 
ting the residual. As the experiments bear out, the 
PG-BART sampler explores the posterior distribution 
more efficiently than samplers based on local moves. 
Of course, one could easily consider non-local moves in 
a Metropolis-Bastings (MB) scheme by proposing com¬ 
plete trees from the tree prior, however these moves 
would be rejected, leading to slow mixing, in high¬ 
dimensional and large data settings. The PG-BART 
sampler succeeds not only because non-local moves are 
considered, but because those non-local moves have 
high posterior probability. Another advantage of the 
PG sampler is that it only requires one to be able to 
sample from the prior and does not require evaluation 
of tree prior in the acceptance ratio unlike (local) MB— 
hence PG can be computationally efficient in situations 
where the tree prior is expensive (or impossible) to 
compute, but relatively easier to sample from. 

The paper is organized as follows: in section 2, we 
review the BART model; in section 3, we review the 
MCMC framework proposed by Chipman et al. (2010) 
and describe the PG sampler in detail. In section 4, 
we present experiments that compare the PG sampler 
to existing samplers for BART. 

2 Model and notation 

In this section, we briefly review decision trees and 
the BART model. We refer the reader to the paper 
of Chipman et al. (2010) for further details about the 
model. Our notation closely follows their’s, but also 

The tree prior term cancels out in the MB acceptance 
ratio if complete trees are sampled. Bowever, sampling 
complete trees from the tree prior would lead to very low 
acceptance rates as discussed earlier. 


borrows from Lakshminarayanan et al. (2013). 

2.1 Problem setup 

We assume that the training data consist of N i.i.d. sam¬ 
ples X = where S K^, along with corre¬ 

sponding labels Y = {yn}n=i^ where ?/„ e K. We focus 
only on the regression task in this paper, although the 
PG sampler can also be used for classification by build¬ 
ing on the work of Chipman et al. (2010) and Zhang 
and Bardie (2010). 

2.2 Decision tree 

For our purposes, a decision tree is a hierarchical binary 
partitioning of the input space with axis-aligned splits. 
The structure of the decision tree is a finite, rooted, 
strictly binary tree T, i.e., a finite collection of nodes 
such that 1 ) every node 77 has exactly one parent node, 
except for a distinguished root node e which has no 
parent, and 2 ) every node rj is the parent of exactly 
zero or two children nodes, called the left child 77 O and 
the right child 77 I. Denote the leaves of T (those nodes 
without children) by leaves(T). Each node of the tree 
77 S T is associated with a block C of the input 
space as follows: At the root, we have B^ = K^, while 
each internal node 77 S T \ leaves(T) with two children 
represents a split of its parent’s block into two halves, 
with Kjj S {1 ,..., D} denoting the dimension of the 
split, and denoting the location of the split. In 
particular, 

Brjo = n {2 e < Tr,} and 

Brjl = Brj n {z G > Tr,}. (1) 

We call the tuple T = (T, k, t) a decision tree. Note 

that the blocks associated with the leaves of the tree 
form a partition of . 

A decision tree used for regression is referred to as 
a regression tree. In a regression tree, each leaf node 
77 G leaves(T) is associated with a real-valued parameter 
G K. Let pi = {/ir)}?)eieaves(T) denote the collection 
of all parameters. Given a tree Y and a data point 
X, let Ieaf 7 -(a;) be the unique leaf node 77 G leaves(T) 
such that X G Brf, and let g{ ■ ;T,pi) be the response 
function associated with T and pi, given by 

g{x,f~,pL) . 'f-(^x) • (^) 

2.3 Likelihood specification for BART 

BART is a sum-of-trees model, i.e., BART assumes that 
the label y for an input x is generated by an additive 
combination of m regression trees. More precisely, 

m 

y = + e, (3) 

1=1 
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where e ^ A/’(0, cr^) is an independent Gaussian noise 
term with zero mean and variance Hence, the 
likelihood for a training instance is 

m 

= J^{y\g{x; a^), 

and the likelihood for the entire training dataset is 

m{Tj, }"Li, X) = n €(y„I {T, , }”Li, 

n 

2.4 Prior specification for BART 

The parameters of the BART model are the noise 
variance and the regression trees (Tj , fJ^j ) for j = 
1,..., TO. The conditional independencies in the prior 
are captured by the factorization 

m 

i=i 

The prior over decision trees p{Tj = {T^, , Tj}|X) 

can be described by the following generative process 
(Chipman et ah, 2010; Lakshminarayanan et ah, 2013): 
Starting with a tree comprised only of a root node e, 
the tree is grown by deciding once for every node p 
whether to 1 ) stop and make p a leaf, or 2 ) split, making 
p an internal node, and add pO and pi as children. The 
same stop/split decision is made for the children, and 
their children, and so on. Let pri be a binary indicator 
variable for the event that p is split. Then every node 
p is split independently with probability 

(1 + depth(?7))^= 

X 1 [valid split exists below 77 in X], (4) 

where the indicator ![...] forces the probability to be 
zero when every possible split of p is invalid, i.e., one 
of the children nodes contains no training data. Infor¬ 
mally, the hyperparameters Og S ( 0 , 1 ) and /3s G [ 0 ,oo) 
control the depth and number of nodes in the tree. 
Higher values of Og lead to deeper trees while higher 
values of /3g lead to shallower trees. 

In the event that a node p is split, the dimension 
and location of the split are assumed to be drawn 
independently from a uniform distribution over the set 
of all valid splits of p. The decision tree prior is thus 

p(T|X) = p/pr, = l)U{Krj)U{Trf\K^) 

?7GT\leaves(T) 

X n 

?7Gleaves(T) 

Note that p{pn = 1) depends on X and the split dimen¬ 
sions and locations at the ancestors of 77 in T due to the 
indicator function for valid splits. We elide this dependence 
to keep the notation simple. 


where f^(-) denotes the probability mass function of the 
uniform distribution over dimensions that contain at 
least one valid split, and denotes the probability 

density function of the uniform distribution over valid 
split locations along dimension Krj in block 

Given a decision tree T, the parameters associated with 
its leaves are independent and identically distributed 
normal random variables, and so 

P{fj-\T)= A'(/x^|TO^,cr2). 

?7Gleaves(T) 

The mean and variance cr^ hyperparameters are 
set indirectly: Ghipman et al. (2010) shift and rescale 
the labels Y such that ymin = —0.5 and 7/max = 0.5, 
and set to^ = 0 and cr^ = 0.5/ky/m, where fc > 0 is 
an hyperparameter. This adjustment has the effect 
of keeping individual node parameters pri small; the 
higher the values of k and to, the greater the shrinkage 
towards the mean m^. 

The prior p{cf‘^) over the noise variance is an inverse 
gamma distribution. The hyperparameters v and q 
indirectly control the shape and rate of the inverse 
gamma prior over Ghipman et al. (2010) compute 
an overestimate of the noise variance e.g., using the 
least-squares variance or the unconditional variance of 
Y, and, for a given shape parameter v, set the rate 
such that Pr(cr <a)=q, i.e., the gth quantile of the 
prior over a is located at tr. 

Ghipman et al. (2010) recommend the default values: 
V = 3,q = 0.9, k = 2, m = 200 and Og = 0.95,/3g = 
2.0. Unless otherwise specified, we use this default 
hyperparameter setting in our experiments. 

2.5 Sequential generative process for trees 

Lakshminarayanan et al. (2013) present a sequential 
generative process for the tree prior p(T|X), where 
a tree T is generated by starting from an empty tree 
7 (0) E^nd sampling a sequence T{i),T{ 2 ),... of partial 
trees. This sequential representation is used as the 
scaffolding for their SMC algorithm. Due to space 
limitations, we can only briefly review the sequential 
process. The interested reader should refer to the 
paper by Lakshminarayanan et al. (2013): Let T{t) = 
T(t), K/t), T(t), i?(t) denote the partial tree at stage t, 
where E(^^^ denotes the ordered set containing the list 
of nodes eligible for expansion at stage t. At stage t, 
the generative process samples T(t) from Prt(-1 7(t_i)) 
as follows: the first element of E, say p, is popped and 
is stochastically assigned to be an internal node or a 
leaf node with probability given by (4). If p is chosen 

Note that Tjt) denotes partial tree at stage t, whereas 
Tj denotes the jith tree in the ensemble. 
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(a) 7^, 


(0)- -C'(O) 


= {e} (b) Tay. 71(1) = {0.1} (c) 7^2): 71(2) = {1} (d) % 


(3)- ^{3) 


= { 10 , 11 } 


(e) 7(6): 71(6) — {} 


Figure 1: Sequential generative process for decision trees: Nodes eligible for expansion are denoted by the ordered set E 
and shaded in gray. At iteration 0, we start with the empty tree and E = {e}. At iteration 1, we pop e from E and assign 
it to be an internal node with split dimension = 1 and split location = 0.5 and append the child nodes 0 and 1 to E. 
At iteration 2, we pop 0 from E and set it to a leaf node. At iteration 3, we pop 1 from E and set it to an internal node, 
split dimension ki = 2 and threshold ri = 0.3 and append the child nodes 10 and 11 to 71. At iterations 4 and 5 (not 
shown), we pop nodes 10 and 11 respectively and assign them to be leaf nodes. At iteration 6, 71 = {} and the process 
terminates. By arranging the random variables p and k, t (if applicable) for each node in the order of expansion, the tree 
can be encoded as a sequence. 


to be an internal node, we sample the split dimension 
Kri and split location r,, uniformly among the valid 
splits, and append 77 O and 77 I to E. Thus, the tree is 
expanded in a breadth-wise fashion and each node is 
visited just once. The process terminates when E is 
empty. Figure 1 presents a cartoon of the sequential 
generative process. 

3 Posterior inference for BART 

In this section, we briefly review the MCMC framework 
proposed in (Chipman et ah, 2010), discuss limitations 
of existing samplers and then present our PG sampler. 

3.1 MCMC for BART 

Given the likelihood and the prior, our goal is to com¬ 
pute the posterior distribution 

i,a2,X) p({7}-,M,}”Li,a2|X). (7) 

Ghipman et al. (2010) proposed a Bayesian backfitting 
MGMC to sample from the BART posterior. At a 
high level, the Bayesian backfitting MGMG is a Gibbs 
sampler that loops through the trees, sampling each 
tree 7) and associated parameters fjij conditioned on 
cr^ and the remaining trees and their associated param¬ 
eters {Tj', and samples cr^ conditioned on all 

the trees and parameters {Tj, Let 

and respectively denote the values of Tj,fXj and 
at the MCMC iteration. Sampling condi¬ 
tioned on {Tj, is straightforward due to conju- 

gacy. To sample Tj , fXj conditioned on the other trees 

Lakshminarayanan et al. (2013) discuss a more general 
version where more than one node may be expanded in 
an iteration. We restrict our attention in this paper to 
node-wise expansion: one node is expanded per iteration 
and the nodes are expanded in a breadth-wise fashion. 


{Tj' 1 we first sample 7j|{7}',tr^ and 

then sample p>j\Tj^ {Tj',lJ,j'}j'^j,a-'^. (Note that pUj is 
integrated out while sampling Tj.) More precisely, we 
compute the residual 

Rj = Y-T,r=,^r^,9{X;Tr,pr)- ( 8 ) 

(i) 

Using the residual R) as the target, Chipman et al. 

( 2010 ) sample Tj'’'^ by proposing local changes to ■ 

Finally, pLj is sampled from a Gaussian distribution 
conditioned on Tj, {Tj', Tti® procedure is 

summarized in Algorithm 1 . 

3.2 Existing samplers for BART 

To sample Tj, Chipman et al. (2010) use the MCMC al¬ 
gorithm proposed by Chipman et al. (1998). This algo¬ 
rithm, which we refer to as CGM. CGM is a Metropolis- 
within-Gibbs sampler that randomly chooses one of the 
following four moves: grow (which randomly chooses a 
leaf node and splits it further into left and right chil¬ 
dren), prune (which randomly chooses an internal node 
where both the children are leaf nodes and prunes the 
two leaf nodes, thereby making the internal node a leaf 
node), change (which changes the decision rule at a 
randomly chosen internal node), swap (which swaps 
the decision rules at a parent-child pair where both the 
parent and child are internal nodes). There are two 
issues with the CGM sampler: (1) the CGM sampler 
makes local changes to the tree, which is known to af¬ 
fect mixing when computing the posterior over a single 
decision tree (Wu et ah, 2007). Chipman et al. (2010) 
claim that the default hyper-parameter values encour¬ 
age shallower trees and hence mixing is not affected 
significantly. However, if one wishes to use BART on 
large datasets where individual trees are likely to be 
deeper, the CGM sampler might suffer from mixing is¬ 
sues. (2) The change and swap moves in CGM sampler 
are computationally expensive for large datasets that 
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Algorithm 1 Bayesian backfitting MCMC for posterior inference in BART 
1: Inputs: Training data (X, Y), BART hyperparameters (z/, g,/c, m, Og,/3s) 

2: Initialization: For all j, set = {e}, = 0} and sample 

3: for i = 1 : max_iter do 

4: Sample > sample from inverse gamma distribution 

5: for j = 1 : m do 

6: Compute residual > using (8) 

7: Sample > using CGM, GrowPrune or PG 

8: Sample > sample from Gaussian distribution 


involve deep trees (since they involve re-computation of 
all likelihoods in the subtree below the top-most node 
affected by the proposal). For computational efficiency, 
Pratola et al. (2013) propose using only the grow and 
prune moves; we will call this the GrowPrune sampler. 
However, as we illustrate in section 4, the GrowPrune 
sampler can inefficiently explore the posterior in scenar¬ 
ios where there are multiple possible trees that explain 
the observations equally well. In the next section, we 
present a novel sampler that addresses both of these 
concerns. 

3.3 PG sampler for BART 

Recall that Chipman et al. (2010) sample using 

as the target by proposing local changes to 7)^*”^^- 
It is natural to ask if it is possible to sample a com¬ 
plete tree rather than just local changes. Indeed, 
this is possible by marrying the sequential representa¬ 
tion of the tree proposed by Lakshminarayanan et al. 
(2013) (see section 2.5) with the Particle Markov Ghain 
Monte Carlo (PMCMC) framework (Andrieu et ah, 
2010) where an SMC algorithm (particle filter) is used 
as a high-dimensional proposal for MCMC. The PG 
sampler is implemented using the so-called conditional 
SMG algorithm (instead of the Metropolis-Hastings 
samplers described in Section 3.2) in line 7 of Algo¬ 
rithm 1. At a high level, the conditional SMG algorithm 
is similar to the SMC algorithm proposed by Lakshmi¬ 
narayanan et al. (2013), except that one of the particles 
is clamped to the current tree ■ 

Before describing the PG sampler, we derive the con¬ 
ditional posterior Tj\{7j', fij' bvJ,cr^ Y,X. Let N{r]) 
denote the set of data point indices n G {1,..., N} 
such that Xn G Bn- Slightly abusing the notation, let 
Rn^t]) denote the vector containing residuals of data 
points in node rj. Given R :=Y — 75'> /^l')) 

it is easy to see that the conditional posterior over 
7) ,Mj is given by 

piRjtP'j |{^'q P'j'} 7 ^ “^) 

cxp(7;|X) A/'(R„|Ai^,cr^)A/'(Ai^|m^,cr^). 

?7Gleaves(Tj) n^N{ri) 


Let 7r(7)) denote the conditional posterior over 7). Inte¬ 
grating out fj, and using (5) for p{Tj |X), the conditional 
posterior 7r(7}) is 

7r(7j) = p(7}|{7}',fJ ,Y,X) 

(xp{Tj\X) p(i?^(,,)|cr^m^,cr2), (9) 

77Gleaves(Tj) 

where p(i?jv(,,) |cr^, cr^) denotes the marginal likeli¬ 
hood at a node rj, given by 

— I Rl'iRnlPr)-) 

neN{ri) 

The goal is to sample from the (conditional) posterior 
distribution Tr(Tj). Lakshminarayanan et al. (2013) 
presented a top-down particle filtering algorithm that 
approximates the posterior over decision trees. Since 
this SMC algorithm can sample complete trees, it is 
tempting to substitute an exact sample from 7r(7}) with 
an approximate sample from the particle filter. How¬ 
ever, Andrieu et al. (2010) observed that this naive 
approximation does not leave the joint posterior dis¬ 
tribution (7) invariant, and so they proposed instead 
to generate a sample using a modified version of the 
SMC algorithm, which they called the conditional-SMG 
algorithm, and demonstrated that this leaves the joint 
distribution (7) invariant. (We refer the reader to the 
paper by Andrieu et al. (2010) for further details about 
the PMCMC framework.) By building off the top- 
down particle filter for decision trees, we can define a 
conditional-SMC algorithm for sampling from 7r(7j). 

The conditional-SMC algorithm is an MH kernel with 
7r(70) as its stationary distribution. To reduce clutter, 
let T* denote the old tree and T denote the tree we 
wish we to sample. The conditional-SMC algorithm 
samples T from a C-particle approximation of 
which can be written as J2c=i^(*^)^T{c} where T(c) 
denotes the tree (particle) and the weights sum to 
1, that is, = 1- 
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SMC proposal: Each particle T(c) is the end product 
of a sequence of partial trees T(o){c), T{i)[c), 7(2)(c), • ■ •, 
and the weight w{c) reflects how well the tree ex¬ 
plains the residual R. One of the particles, say the 
first particle, without loss of generality, is clamped to 
the old tree T* at all stages of the particle filter, i.e., 
= T^t)- At stage t, the remaining C — 1 parti¬ 
cles are sampled from the sequential generative process 
Prt(-1 T(t-i)ic)) described in section 2.5. Unlike state 
space models where the length of the latent state se¬ 
quence is fixed, the sampled decision tree sequences 
may be of different length and could potentially be 
deeper than the old tree T* ■ Hence, whenever E(() = 0, 
we set Y’v(t){T(t)\T(t-i)) = i-®-, T(t) = 7(t-i)- 

SMC weight update: Since the prior is used as the 
proposal, the particle weight W(t) (c) is multiplicatively 
updated with the ratio of the marginal likelihood of 
7(t)(c) to the marginal likelihood of 7(t_i)(c). The 
marginal likelihood associated with a (partial) tree T 
is a product of the marginal likelihoods associated with 
the leaf nodes of T defined in (10). Like Lakshmi- 
narayanan et al. (2013), we treat the eligible nodes E(t) 
as leaf nodes while computing the marginal likelihood 
for a partial tree 7(t). Plugging in (10), the SMC weight 
update is given by (11) in Algorithm 2. 

Resampling: The resampling step in the conditional- 
SMC algorithm is slightly different from the typical 
SMC resampling step. Recall that the first particle is 
always clamped to the old tree. The remaining C — 1 
particles are resampled such that the probability of 
choosing particle c is proportional to its weight W(t)ic). 
We used multinomial resampling in our experiments, 
although other resampling strategies are possible. 

When none of the trees contain eligible nodes, the 
conditional-SMC algorithm stops and returns a sam¬ 
ple from the particle approximation. Without loss of 
generality, we assume that the particle is returned. 
The PG sampler is summarized in Algorithm 2. 

The computational complexity of the conditional-SMC 
algorithm in Algorithm 2 is similar to that of the top- 
down algorithm (Lakshminarayanan et ah, 2013, §3.2). 
Even though the PG sampler has a higher per-iteration 
complexity in general compared to GrowPrune and 
GGM samplers, it can mix faster since it can propose 
a completely different tree that explains the data. The 
GrowPrune sampler requires many iterations to explore 
multiple modes (since a prune operation is likely to 
be rejected around a mode). The CGM sampler can 
change the decisions at internal nodes; however, it is 
inefficient since a change in an internal node that leaves 
any of the nodes in the subtree below empty will be 
rejected. We demonstrate the competitive performance 
of PG in the experimental section. 


4 Experimental evaluation 

In this section, we present experimental comparisons 
between the PG sampler and existing samplers for 
BART. Since the main contribution of this work is a 
different inference algorithm for an existing model, we 
just compare the efficiency of the inference algorithms 
and do not compare to other models. BART has been 
shown to demonstrate excellent prediction performance 
compared to other popular black-box non-linear re¬ 
gression approaches; we refer the interested reader to 
Ghipman et al. (2010). 

We implemented all the samplers in Python and ran 
experiments on the same desktop machine so that the 
timing results are comparable. The scripts can be 
downloaded from the authors’ webpages. We set the 
number of particles C = 10 for computational efficiency 
and max-stages = 5000, following Lakshminarayanan 
et al. (2013), although the algorithm always terminated 
much earlier. 

4.1 Hypercube-U dataset 

We investigate the performance of the samplers on a 
dataset where there are multiple trees that explain the 
residual (conditioned on other trees). This problem is 
equivalent to posterior inference over a decision tree 
where the labels are equal to the residual. Hence, we 
generate a synthetic dataset where multiple trees are 
consistent with the observed labels. Intuitively, a local 
sampler can be expected to mix reasonably well when 
the true posterior consists of shallow trees; however, 
a local sampler will lead to an inefficient exploration 
when the posterior consists of deep trees. Since the 
depth of trees in the true posterior is at the heart of the 
mixing issue, we create synthetic datasets where the 
depth of trees in the true posterior can be controlled. 

We generate the hypercube-H dataset as follows: for 
each of the 2^ vertices of [—1,1]-®, we sample 10 data 
points. The x location of a data point is generated as 
X = V + e where v is the vertex location and e is a ran¬ 
dom offset generated as e ~ ^(0, O.l^/u). Each vertex 
is associated with a different function value and the 
function values are generated from A/'(0,3^). Finally 
the observed label is generated a,s y = f + e where 
/ denotes the true function value at the vertex and 
e ^ A/’(0,0.01^). Figure 2 shows a sample hypercube- 
2 dataset. As D increases, the number of trees that 
explains the observations increases. 

We fix m = l,as = 0.95 and set remaining BART 
hyperparameters to the default values. Since the true 
tree has 2^ leaves, we set /3s such that the expected 

The values of /3s for D = 2,3,4,5 and 7 are 
1.0,0.5, 0.4, 0.3 and 0.25 respectively. 
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Algorithm 2 Conditional-SMC algorithm used in the PG-BART sampler 
1: Inputs: Training data: features X, ‘target’ R > R denotes residual in BART 

2: Number of particles C 

3: Old tree T* (along with the partial tree sequence > '^( 2 ) > • ■ ■) 

4: Initialize: For c = 1 : C, set T(o)(c) = i?(o)(c) = {e} and 'r(o)(c) = K(o)(c) = 0 
5: For c = 1 : C, set weights W(o)(c) = p(i?y(e)|( t^, cr^) and VF(o) = J2c'^(o)(c) 

6: for t = 1 : max-stages do 

7: Set 7(t)(l) = > clamp the first particle to the partial tree ofT* at stage t 

8: for c = 2 : C do 

9: Sample 7(t)(c) from Pr(t)(- \T(t-i){c)) where 7(t)(c) := (T(t)(c),/«(t) (c), T(t) (c), £ 1 ( 4 ) (c)) > section 2.5 

10: for c = 1 : C do 

11: > //£'( 4 _i)(c) is non-empty, let rj denote the node popped from £'p_i)(c). 

12: Update weights: 


u'(t)(c) 


-JT- 

P[RN{ri)W^,rn^,af) 


if £l(j_i)(c) is empty or 77 is stopped, 
if rj is split. 


( 11 ) 


13: Compute normalization: lUp) = ^(t) ('-) 

14: Normalize weights: (Vc)’u;(t)(c) = ri;(t)(c)/VF(t) 

15: Set ji = 1 and for c = 2 : C, resample indices jc from PP(t)i.c.)5c' > resample all particles except the 

first 

16: (Vc) T(t){c) •<— T(t){jc)] w;(t)(c) ^ Wpr^jC 

17: if (Vc) i?( 7 )(c) = 0 then exit for loop 

return 7(t)(C') = (T(j)(C'), K(i)(C'), T(t)(C')) > return a sample from the approximation to 

line 7 of Algorithm 1 
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Figure 2: Hypercube-2 dataset: see main text for details. 

number of leaves is roughly 2^. We run 2000 iterations 
of MCMC. Figure 3 illustrates the posterior trace plots 
for D = A. (See supplemental material for additional 
posterior trace plots.) We observe that PG converges 
much faster to the posterior in terms of number of leaves 
as well as the test MSE. We observe that GrowPrune 
sampler tends to overestimate the number of leaves; the 
low value of train MSE indicates that the GrowPrune 
sampler is stuck close to a mode and is unable to explore 
the true posterior. Pratola (2013) has reported similar 
behavior of GrowPrune sampler on a different dataset 
as well. 


Figure 3: Results on Hypercube-4 dataset. 


We compare the algorithms by computing effective sam¬ 
ple size (ESS). ESS is a measure of how well the chain 
mixes and is frequently used to assess performance of 
MCMC algorithms; we compute ESS using R-CODA 
(Plummer et ah, 2006). We discard the hrst 1000 itera¬ 
tions as burn-in and use the remaining 1000 iterations 
to compute ESS. Since the per iteration cost of gener¬ 
ating a sample differs across samplers, we additionally 
report ESS per unit time. The ESS (computed using 
log-likelihood values) and ESS per second (ESS/s) val- 
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D 

CGM 

GrowPrune 

PG 

2 

751.66 

473.57 

259.11 

3 

762.96 

285.2 

666.71 

4 

14.01 

11.76 

686.79 

5 

2.92 

1.35 

667.27 

7 

1.16 

1.78 

422.96 


Table 1: Comparison of ESS for CGM, GrowPrune and 
PG samplers on Hypercube-_D dataset. 


D 

CGM 

GrowPrune 

PG 

2 

157.67 

114.81 

7.69 

3 

93.01 

26.94 

11.025 

4 

0.961 

0.569 

5.394 

5 

0.130 

0.071 

1.673 

7 

0.027 

0.039 

0.273 


Table 2: Comparison of ESS/s (ESS per second) for CGM, 
GrowPrune and PG samplers on Hypercube-D dataset. 

ues are shown in Tables 1 and 2 respectively. When the 
true tree is shallow {D — 2 and D = 3), we observe that 
CGM sampler mixes well and is computationally effi¬ 
cient. However, as the depth of the true tree increases 
{D = 4, 5, 7), PG achieves much higher ESS and ESS/s 
compared to CGM and GrowPrune samplers. 

4.2 Real world datasets 

In this experiment, we study the effect of the data 
dimensionality on mixing. Even when the trees are 
shallow, the number of trees consistent with the labels 
increases as the data dimensionality increases. Using 
the default BART prior (which promotes shallower 
trees), we compare the performance of the samplers on 
real world datasets of varying dimensionality. 

We consider the CaliforniaHouses, YearPredictionMSD 
and CTslices datasets used by Johnson and Zhang 
(2013). For each dataset, there are three training sets, 
each of which contains 2000 data points, and a single 
test set. The dataset characteristics are summarized 
in Table 3. 


Dataset 

Atrain 

Atest 

D 

CaliforniaHouses 

2000 

5000 

6 

YearPredictionMSD 

2000 

51630 

90 

CTslices 

2000 

24564 

384 


Table 3: Characteristics of datasets. 


We run each sampler using the three training datasets 
and report average ESS and ESS/s. All three samplers 
achieve very similar MSE to those reported by Johnson 
and Zhang (2013). The average number of leaves in the 


posterior trees was found to be small and very similar 
for all the samplers. Tables 4 and 5 respectively present 
results comparing ESS and ESS/s of the different sam¬ 
plers. As the data dimensionality increases, we observe 
that PG outperforms existing samplers. 


Dataset 

CGM 

GrowPrune 

PG 

CaliforniaHouses 

18.956 

34.849 

76.819 

YearPredictionMSD 

29.215 

21.656 

76.766 

CTslices 

2.511 

5.025 

11.838 


Table 4: Comparison of ESS for CGM, GrowPrune and 
PG samplers on real world datasets. 


Dataset 

CGM 

xlO-3 

GrowPrune 

xlO-3 

PG 

xlO-3 

CaliforniaHouses 

1.967 

48.799 

16.743 

YearPredictionMSD 

2.018 

7.029 

14.070 

CTslices 

0.080 

0.615 

2.115 


Table 5: Comparison of ESS/s for CGM, GrowPrune and 
PG samplers on real world datasets. 

5 Discussion 

We have presented a novel PG sampler for BART. Un¬ 
like existing samplers which make local moves, PG can 
propose complete trees. Experimental results confirm 
that PG dramatically increases mixing when the true 
posterior consists of deep trees or when the data di¬ 
mensionality is high. While we have presented PG only 
for the BART model, it is applicable to extensions of 
BART that use a different likelihood model as well. PG 
can also be used along with other priors for decision 
trees, e.g., those of Denison et al. (1998), Wu et al. 
(2007) and Lakshminarayanan et al. (2014). Backward 
simulation (Lindsten and Schon, 2013) and ancestral 
sampling (Lindsten et al., 2012) have been shown to sig¬ 
nificantly improve mixing of PG for state-space models. 
Extending these ideas to PG-BART is a challenging 
and interesting future direction. 
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Supplementary material 

A Results on hypercube—D dataset 
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Figure 4: Results on Hypercube-2 dataset. 
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Figure 5: Results on Hypercube-3 dataset. 










































